HuMo is a unified, human-centered video generation framework that can generate high-quality, fine-grained, and controllable human videos based on multimodal inputs such as text, images, and audio. It supports powerful text prompt following, consistent subject retention, and synchronized audio-driven motion.
Multimodal
Gguf